Urine Metabolomics Pancreatitis QC/QA

Author

Robert M Flight

Published

2024-06-27 10:47

Purpose

Quality control / quality analysis (QC/QA) report of the urine metabolomics from pancreatitis samples. There is an Executive Summary at the end of this report.

This report should have been supplied as two versions:

  1. A Word document.
  2. A self-contained HTML file with possibly some interactive graphics.

Data

Data consists of urine measured metabolites, and various sample metadata.

Methods

We analyzed all the data using R v 4.3.0 (R Core Team 2021). Data were read in using readxl or readr depending on source. For metadata, extra metadata was joined to basic metadata supplied using the processing_id. Samples were normalized by osmolarity alone, median abundance, or combination of osmolarity and median abundance.

Sample-sample correlations were calculated using information-content-informed Kendall-tau, a modification of Kendall-tau correlation to allow the inclusion of missing values (Robert M. Flight, Bhatt, and Moseley 2022). Almost all of the analysis below uses the intensities directly provided, so the correlations here should correspond almost 1:1 to a normal Kendall-tau.

Statistical tests of sample metadata to sample principal component scores were performed using ANOVA tests implemented in the visualizationQualityControl package v 0.4.10 (Robert M. Flight and Moseley 2021).

We determined the limit of detection (LOD) across all metabolites from the mean and standard deviation of each metabolite in the pooled samples. Generating bins of 0.05 across the mean, we calculated the standard deviation of the standard deviation (SDoSD) of the mean values within each bin, and examine the plot to determine where the SDoSD begines to increase.

All Data

To start, we examine correlation and principal component analysis grouping using all of the data, including the pooled quality-control samples provided by the metabolomics core.

Let’s first double check how many of each disease group of samples we have. The counts are shown in Table 1. Tables 2, 3 show the breakdown by race and gender, respectively.

Table 1. Number of samples in each disease group.

cohort_group cohort_c n_samples
Green I 1 40
Pooled 0 15
Red 6 40

Table 2. Number of samples by race.

race_pt n_samples
Asian 3
Black or African American 14
Do not know 1
More than one race 1
White 61
NA 15

Table 3. Number of samples by gender.

gender n_samples
F 45
M 35
NA 15

Need for Normalization

We can check whether samples need more than osmolarity normalization by examinging boxplots of the metabolite intensity distribution before and after normalization. These are shown in Figure 1. As shown, we don’t think that using osmolarity alone is enough for normalization for these samples. Interestingly, median or osmolarity+median normalization give the same intensities.

Figure 1. Metabolite log(intensity) boxplots for no normalization (none), median normalization (median), osmolarity normalization, or both median and osmolarity normalizatin (osmo-med), with samples colored by which disease group they belong to (cohort_c).

ICI-Kt

For all sample-sample pairs, we calculate the ICI-Kt correlation (see Methods). These correlations are shown as a heatmap in Figure 2. The way to read this heatmap is that the correlation values are encoded as a color, and each square is the correlation of each sample with another sample. So row 1, column 2 (and column 1, row 2) represents the correlation of pooled sample 1 to pooled sample 2. The disease group of each sample is encoded by the colors along the rows and columns of the heatmap. The ordering of the samples in the heatmap is decided by treating the correlation as a similarity (1 - correlation), and then clustering them using hierarchical clustering. Therefore, ideally we would see groups of high correlating samples that also group by their disease status.

Figure 2. ICI-Kt correlation heatmap of all samples, ordered by similarity (1 - correlation) and colored by disease group.

This, we admit is not what we hope to see. There are not large groups of high correlations that also correspond to disease group. In fact, there doesn’t seem to be anything with really high correlation outside of the pooled replicates. The highest correlations should be just off the diagonal, i.e. the correlation with the nearest neighbor, except where there are bigger differences. Let’s see what that distribution looks like, in Figure 3.

Figure 3. Histogram of the direct neighbor ICI-Kt correlations (just off the diagonal) from Figure 2.

So a mean value of 0.62, which isn’t stupendous, but also not too bad either. The real problem is that green and red samples have comparatively high correlations with each other. For example, the ICI-Kt correlation of “lusczek_138” (red) with “lusczek_030” (green) is 0.596.

We can confirm what this really looks like by plotting their actual intensities against each other too, shown in Figure 4.

Figure 4. Plot of the log-intensities of lusczek_138 (red sample) vs lusczek_030 (green sample).

PCA

We double check all of the above using principal components analysis (PCA), which operates slightly differently than the ICI-Kt correlation. Figure 5 shows the first two principal components on the osmolarity+median normalized data, with samples colored by disease group.

Figure 5. PCA plot of samples colored by their disease group.

Green & Red Only

Looking at Figure 5, we can see that all of the pooled samples cluster right in the middle of the plot, and in fact are just off the 0, 0 point on each of PC1 and PC2. It is possible that they are messing with things a bit. Therefore, we restrict the analysis to just green I and red samples (cohort_c values of 1 and 6, control and chronic pancreatitis, respectively).

ICI-Kt

Figure 6. ICI-Kt heatmap of green I and red samples only, samples arranged by their sample - sample similarity.

When grouped by the cohort group, there is nothing obviously wrong in the sample - sample correlations, as shown in Figure 7.

Figure 7. ICI-Kt heatmap of green I and red samples only, samples arranged by their sample - sample similarity within each cohort of samples.

Outlier Samples

We can also use the median correlations and feature intensity distributions within each cohort of samples to check if any are outlier samples that should be removed prior to differential analysis. Figure 8 shows that there is a single sample in cohort 1 (green I) that should be removed.

Figure 8. Sample scores, and indication of whether the sample should be considered an outlier based on the combination of metabolite intensity outlier fraction and sample - sample median correlations.

Figure 9. ICI-Kt heatmap of Control and CP samples only, samples arranged by their sample - sample similarity. Colors indicate the class of sample, as well as whether the sample was detected as an outlier.

PCA

Figure 10. PCA of green and red samples only, after removing the outlier sample. Notice it still requires PC3 to separate them.

We can also plot the loadings of the metabolites and see what is going on here.

Figure 11. PCA loadings of metabolites on PC1 and PC3. Metabolites with a significant loading value for either of PC1 or PC3 (p-value <= 0.05) are shown in red and labeled.

Figure 12. Score plots of PC1 vs PC3, PC1 vs PC2, PC2 vs PC3, and the loadings of metabolites on PC1 vs PC3.

PC Scores vs Attributes

We tested the sample attributes against their principal component scores. PC3 shows up as being very important, but interestingly, it seems to have the same p-value for both cohort_group and year. When we check the disease group by year, we see that green I (1) and red (6) are perfectly confounded by year.

Table 4. ANOVA associations of various sample attributes and principal components.

PC p.value statistic df
cohort_group
PC3 1.36 × 10−4 16.13 1.00
PC1 2.29 × 10−4 14.95 1.00
PC53 5.60 × 10−2 3.77 1.00
diabetes_bl
PC3 2.28 × 10−8 22.38 2.00
PC1 2.74 × 10−4 9.16 2.00
PC4 7.36 × 10−2 2.70 2.00
disease_cohort
PC3 1.36 × 10−4 16.13 1.00
PC1 2.29 × 10−4 14.95 1.00
PC53 5.60 × 10−2 3.77 1.00
drink_combo
PC77 3.20 × 10−3 3.94 5.00
PC9 5.73 × 10−3 3.60 5.00
PC4 1.93 × 10−2 2.89 5.00
etiology
PC3 2.53 × 10−4 7.22 3.00
PC1 1.36 × 10−3 5.74 3.00
PC18 2.08 × 10−2 3.45 3.00
etoh_etiology
PC3 6.68 × 10−5 10.94 2.00
PC1 4.35 × 10−4 8.58 2.00
PC18 1.63 × 10−2 4.35 2.00
gender
PC51 2.81 × 10−3 9.53 1.00
PC24 2.87 × 10−2 4.97 1.00
PC1 2.96 × 10−2 4.91 1.00
race_pt
PC29 3.06 × 10−3 4.39 4.00
PC17 3.64 × 10−3 4.27 4.00
PC58 2.13 × 10−2 3.07 4.00
site
PC25 1.08 × 10−2 4.80 2.00
PC21 2.22 × 10−2 4.01 2.00
PC17 3.66 × 10−2 3.46 2.00
year
PC3 1.36 × 10−4 16.13 1.00
PC1 2.29 × 10−4 14.95 1.00
PC53 5.60 × 10−2 3.77 1.00

Table 5. ANOVA results for PC1 and PC3 only.

variable PC p.value statistic df
cohort_group PC3 1.36 × 10−4 16.13 1.00
year PC3 1.36 × 10−4 16.13 1.00
cohort_group PC1 2.29 × 10−4 14.95 1.00
year PC1 2.29 × 10−4 14.95 1.00

Table 6. Number of samples by disease group and year. Highlighted group 1 (green I) and group 6 (red).

cohort_c 2017 2019
1 0 39
6 40 0

We can also return the statistics for all the things significant for PC1 - PC3, as well as all the values for cohort_group, as shown in Table 7.

Table 7. Statistics and p-values for ANOVA tested attributes against sample scores for PCs 1 - 3.

variable PC p.value statistic
cohort_group PC1 2.29 × 10−4 14.95
diabetes_bl PC1 2.74 × 10−4 9.16
etoh_etiology PC1 4.35 × 10−4 8.58
etiology PC1 1.36 × 10−3 5.74
drink_combo PC1 2.12 × 10−2 2.84
gender PC1 2.96 × 10−2 4.91
drink_combo PC2 3.04 × 10−1 1.23
etiology PC2 3.37 × 10−1 1.14
diabetes_bl PC2 3.88 × 10−1 0.96
gender PC2 6.51 × 10−1 0.21
etoh_etiology PC2 6.78 × 10−1 0.39
cohort_group PC2 8.21 × 10−1 0.05
diabetes_bl PC3 2.28 × 10−8 22.38
etoh_etiology PC3 6.68 × 10−5 10.94
cohort_group PC3 1.36 × 10−4 16.13
etiology PC3 2.53 × 10−4 7.22
drink_combo PC3 9.46 × 10−2 1.96
gender PC3 6.31 × 10−1 0.23

Diabetes NA Samples

Can we assume that the patients labeled as N/A for diabetic status are really No? As shown in Table 8, these all seem to occur in either Green I / II, which makes sense, as there would be no reason to test for diabetes in a patient with no known pancreatic issues. Figure 13 labels the red-green PCA plot with diabetic status instead. Although they are not perfectly overlapped, the N/A and No groups are much more similar and overlapped than the Yes group, implying that N/A actually means No. I think we are safe to impute the N/A values with No.

This is useful to add “diabetes_bl” and “gender” as covariates to test and alternatively control for in multi-way ANOVA tests.

cohort_c diabetes_bl n_sample
0 NA 15
1 N/A 39
1 Yes 1
6 No 16
6 Yes 24

Figure 13. Comparison of diabetic status using PC1 and PC3.

Check Metabolites With High Loadings

For each principal component (PC), we can compare the loadings of each metabolite in that PC to all the other metabolite loadings in other PCs, i.e. a null distribution. Therefore, the p-values reported here for each PC should be correct, with no need for multiple testing correction.

Table 9. P-values of each metabolite feature (feature_id) with a p-value for each PC.

metabolite p.value
PC1
mannose 6.27 × 10−4
6-deoxyglucitol 5.02 × 10−3
31764 9.29 × 10−3
892 1.14 × 10−2
34108 1.16 × 10−2
10176 1.31 × 10−2
aconitic acid 1.46 × 10−2
acetaminophen glucuronide 2.40 × 10−2
109423 2.46 × 10−2
18242 2.55 × 10−2
32528 2.85 × 10−2
2-deoxytetronic acid 3.61 × 10−2
345452 4.18 × 10−2
4-hydroxyphenylacetic acid 4.27 × 10−2
tagatose 4.47 × 10−2
41924 4.47 × 10−2
85651 4.52 × 10−2
dehydroascorbic acid 4.88 × 10−2
PC2
4529 5.49 × 10−4
34172 5.88 × 10−4
quinic acid 7.84 × 10−4
32528 2.82 × 10−3
34169 3.80 × 10−3
33993 4.08 × 10−3
2031 5.45 × 10−3
215881 7.61 × 10−3
furoylglycine 1.12 × 10−2
mannose 1.18 × 10−2
134464 1.38 × 10−2
14778 1.58 × 10−2
dehydroascorbic acid 1.82 × 10−2
3-(3-hydroxyphenyl)-3-hydroxypropionic acid 1.85 × 10−2
34007 1.91 × 10−2
163829 2.32 × 10−2
109423 3.11 × 10−2
474645 3.36 × 10−2
31564 3.40 × 10−2
3-hydroxy-3-(4'-hydroxy-3'-methoxyphenyl)propionic acid 3.42 × 10−2
109393 3.81 × 10−2
12444 4.20 × 10−2
PC3
109423 3.53 × 10−3
glucose 4.35 × 10−3
4983 5.84 × 10−3
mannose 1.22 × 10−2
171295 1.25 × 10−2
ascorbic acid 1.37 × 10−2
33468 1.38 × 10−2
phosphate 1.74 × 10−2
1-methylgalactose 2.66 × 10−2
109393 2.86 × 10−2
furoylglycine 3.05 × 10−2
glucosamine 3.62 × 10−2
354281 3.87 × 10−2
phenol 4.10 × 10−2
519883 4.16 × 10−2
acetaminophen glucuronide 4.90 × 10−2
dehydroascorbic acid 4.94 × 10−2
1,2-anhydro-myo-inositol 4.96 × 10−2
PC4
109423 3.92 × 10−5
alloxanoic acid 1.37 × 10−3
4983 4.90 × 10−3
171295 7.21 × 10−3
acetaminophen glucuronide 8.94 × 10−3
121305 1.18 × 10−2
33468 1.33 × 10−2
phosphate 1.62 × 10−2
31288 2.63 × 10−2
354281 2.73 × 10−2
homovanillic acid 2.74 × 10−2
34025 2.85 × 10−2
34022 4.04 × 10−2
3-(3-hydroxyphenyl)-3-hydroxypropionic acid 4.09 × 10−2
citric acid 4.70 × 10−2

Any Changes Using a Limit of Detection?

All of the above was done using the data directly provided by the metabolomics core, without searching for a lower limit of what is actually detectable (limit of detection, LoD). Just to double check if setting a LoD made an impact, we did use the standard deviation to mean relationship of metabolites measured in the pooled replicates to estimate a lower limit to be 0.2 for the osmolarity - median normalized data, as shown in Figure 14. This results in 9.84% or 10.4% of data missing when including or removing the pooled samples, respectively.

However, it does not improve samples grouping together by disease status (not shown).

Figure 14. (A) The mean vs SDoSD for all points where we could calculate a SDoSD. (B) Zoomed view of (A).

Metabolite - Metabolite Ratios

Out of curiosity, what happens if we take all the possible metabolite - metabolite ratios and calculate them?

Normalization

Theoretically, this should remove the need for normalization.

Figure 15. Metabolite - metabolite log-ratios.

ICI-Kt

Green I & Red PCA Table

Here is the information for the PCs from doing PCA on the Green I and Red cohorts only.

pc variance percent cumulative labels
PC1 36.177 0.147 0.147 PC1 (15%)
PC2 18.368 0.075 0.222 PC2 (7.5%)
PC3 14.008 0.057 0.278 PC3 (5.7%)
PC4 12.297 0.050 0.328 PC4 (5%)
PC5 10.793 0.044 0.372 PC5 (4.4%)
PC6 9.644 0.039 0.411 PC6 (3.9%)
PC7 7.898 0.032 0.444 PC7 (3.2%)
PC8 7.548 0.031 0.474 PC8 (3.1%)
PC9 6.793 0.028 0.502 PC9 (2.8%)
PC10 6.394 0.026 0.528 PC10 (2.6%)
PC11 5.633 0.023 0.551 PC11 (2.3%)
PC12 5.392 0.022 0.573 PC12 (2.2%)
PC13 5.050 0.021 0.593 PC13 (2.1%)
PC14 4.707 0.019 0.612 PC14 (1.9%)
PC15 4.609 0.019 0.631 PC15 (1.9%)
PC16 4.148 0.017 0.648 PC16 (1.7%)
PC17 3.897 0.016 0.664 PC17 (1.6%)
PC18 3.798 0.015 0.679 PC18 (1.5%)
PC19 3.604 0.015 0.694 PC19 (1.5%)
PC20 3.453 0.014 0.708 PC20 (1.4%)
PC21 3.330 0.014 0.721 PC21 (1.4%)
PC22 3.121 0.013 0.734 PC22 (1.3%)
PC23 3.038 0.012 0.746 PC23 (1.2%)
PC24 2.758 0.011 0.757 PC24 (1.1%)
PC25 2.690 0.011 0.768 PC25 (1.1%)
PC26 2.509 0.010 0.779 PC26 (1%)
PC27 2.407 0.010 0.788 PC27 (0.98%)
PC28 2.320 0.009 0.798 PC28 (0.94%)
PC29 2.222 0.009 0.807 PC29 (0.9%)
PC30 2.177 0.009 0.816 PC30 (0.88%)
PC31 2.056 0.008 0.824 PC31 (0.84%)
PC32 2.032 0.008 0.832 PC32 (0.83%)
PC33 1.927 0.008 0.840 PC33 (0.78%)
PC34 1.843 0.007 0.848 PC34 (0.75%)
PC35 1.804 0.007 0.855 PC35 (0.73%)
PC36 1.665 0.007 0.862 PC36 (0.68%)
PC37 1.647 0.007 0.868 PC37 (0.67%)
PC38 1.540 0.006 0.875 PC38 (0.63%)
PC39 1.507 0.006 0.881 PC39 (0.61%)
PC40 1.475 0.006 0.887 PC40 (0.6%)
PC41 1.407 0.006 0.892 PC41 (0.57%)
PC42 1.387 0.006 0.898 PC42 (0.56%)
PC43 1.301 0.005 0.903 PC43 (0.53%)
PC44 1.229 0.005 0.908 PC44 (0.5%)
PC45 1.206 0.005 0.913 PC45 (0.49%)
PC46 1.186 0.005 0.918 PC46 (0.48%)
PC47 1.086 0.004 0.923 PC47 (0.44%)
PC48 1.054 0.004 0.927 PC48 (0.43%)
PC49 1.024 0.004 0.931 PC49 (0.42%)
PC50 1.017 0.004 0.935 PC50 (0.41%)
PC51 0.972 0.004 0.939 PC51 (0.39%)
PC52 0.944 0.004 0.943 PC52 (0.38%)
PC53 0.896 0.004 0.947 PC53 (0.36%)
PC54 0.892 0.004 0.950 PC54 (0.36%)
PC55 0.848 0.003 0.954 PC55 (0.34%)
PC56 0.831 0.003 0.957 PC56 (0.34%)
PC57 0.771 0.003 0.960 PC57 (0.31%)
PC58 0.761 0.003 0.963 PC58 (0.31%)
PC59 0.678 0.003 0.966 PC59 (0.28%)
PC60 0.661 0.003 0.969 PC60 (0.27%)
PC61 0.656 0.003 0.971 PC61 (0.27%)
PC62 0.614 0.002 0.974 PC62 (0.25%)
PC63 0.586 0.002 0.976 PC63 (0.24%)
PC64 0.584 0.002 0.979 PC64 (0.24%)
PC65 0.525 0.002 0.981 PC65 (0.21%)
PC66 0.510 0.002 0.983 PC66 (0.21%)
PC67 0.480 0.002 0.985 PC67 (0.2%)
PC68 0.464 0.002 0.987 PC68 (0.19%)
PC69 0.429 0.002 0.988 PC69 (0.17%)
PC70 0.420 0.002 0.990 PC70 (0.17%)
PC71 0.390 0.002 0.992 PC71 (0.16%)
PC72 0.379 0.002 0.993 PC72 (0.15%)
PC73 0.337 0.001 0.995 PC73 (0.14%)
PC74 0.319 0.001 0.996 PC74 (0.13%)
PC75 0.295 0.001 0.997 PC75 (0.12%)
PC76 0.265 0.001 0.998 PC76 (0.11%)
PC77 0.248 0.001 0.999 PC77 (0.1%)
PC78 0.221 0.001 1.000 PC78 (0.09%)
PC79 0.000 0.000 1.000 PC79 (0.00000000000000000000000000000038%)

Executive Summary

  • Disease state is not the biggest driver of variance. Something else is. Possibly gender.
  • This doesn’t mean there isn’t differences in disease states.
  • We were concerned about a confound of year with disease for Green I vs Red.
  • Metabolite ratios may be required to figure out what is actually different.
  • A lower limit of detection can be determined, but at this stage it doesn’t seem to affect anything we’ve examined for QC/QA.

References

Flight, Robert M., Praneeth S. Bhatt, and Hunter NB Moseley. 2022. “Information-Content-Informed Kendall-Tau Correlation: Utilizing Missing Values.” bioRxiv. https://doi.org/10.1101/2022.02.24.481854.
Flight, Robert M, and Hunter NB Moseley. 2021. “visualizationQualityControl: Development of Visualization Methods for Quality Control.” Manual. https://moseleybioinformaticslab.github.io/visualizationQualityControl https://github.com/moseleybioinformaticslab/visualizationQualityControl.
R Core Team. 2021. “R: A Language and Environment for Statistical Computing.” Manual. Vienna, Austria. https://www.R-project.org/.